NOV. 9. 2006 1:51 PM ZILKA-KOTAB, PC 



NO. 4682 P. 7 



PCX 



WORLD 'N^gW^^gamr OUOAmAllON 




im^ATONAL APPUCAHON FUBUSHED UN DER THE PATE^^• CQQPERAITQM (pci, 



GIOL 



A2 



(11) Intermdona] Publication Nombcr; WO 00/46787 

(43) toteimtfonal iPublic^tion Data: lo August 2000 (10^.00) 



(21) InternatioimJ AppUcation Numbcri PCr/US0CW)28O8 

(22) laternaliobal FTUng Date; 4 Pcbmary 2000 (04.02^) 



(30) Priority Data: 
60/116.949 



5 I^eteuary ) 999 (05^99) US 



(71) ApplI^t^atf CUSTOM 
SPEECH US/>L, INC fUS/US]; Soiifc B365. 3 North Court 
Street, Crown Pdnc, IN 46307 (US), 

(72) biventors; and 

/Ji^TTiL"^ Cluqrenne Drive, Cnrwn Poini. [N 46307 
(US). QIN diaries [-/US]; 23461 North Qari«, Lane 
Ukc Zunch, IL 60047 (US). PLYNN. Thomas. P. [USrtJS]' 
S62 Ridgdawn Soad, Crown Point, in 4S307 (US). UPMS. 

6M^Cu'si"^^'' ^"^ ^^'^^ ^ 

(74) Agents: SIOALB. Jaraan. A, et al.; Sonnenschein, Naih 
4 Rosenftsl 8000 S<u» Twer. 233 S. Wscker Drtva 
Chicago IL 60606-6404 (US). ^ 



il^'^ii!^ AT. AU. AZ. BA. BB. BO. 
™' CU, CZ, DE, DK. DM, EB 

»QB, KG. KP. KR, KZ, LC; UC, LR, LS. LT, LU. LV MA. 

SD, SE. SG, SI, SK. SL, TJ. TM, TO. TT, fe UA UG 

^'J^'^ ^^^"^ UG, ZWX Bu«sto patent AH 
5? ?X' 'S*fP' TJ.ll^ Eunjpaan patent (A? 
BE, CH. CY, DE, DK. ES, Pi, FR, OB, GR. E^. LU 

P"*"* C^P- ^- OP. CO. CI, CH 
OA,<»f.aW.ML,MR,NB,SN. n>,1tJ), 

Fublfidioil 

WHhout imrmaional aairch rtport and to be refiubluhed 
i4>on rcM^l of that report. 



i 



DiGfTALAUOK) 






REC0R0(N3 




STATION 




^^15 



KGJTAl AUDIO 
RECORDING 
STATION 



OiQITAL AUDIO 
RECOW)ING 
STATION ^2 



DJGrrAL AUDIO 
PLAYER 



(S4)T:fe SYSTEM METHOD FOR AUTOMATTNtS TKANSCRIPTION SERVICES 
(57) Atetract 

A system for aubetantlaUy autoraai- 
ing trai«CTiptton serdces for multiple voice 
including a manual transcription sta- 
tion, a speech reco^nitloo program and a 
rwjting program. TtiQ system e8tabti£he$ 
a jjrofilo for each of the voice user} con- 
tBjnmg a iraining status vWcli Is selected 
from Che group of enrollraent, trainlnt, au- 
tomated and stop automation. Hie systMU 
gcneraiea a uniquely identified voice dicta^ 
non m from b cuntpt voice user and — 
based on Iho training status the system — 
routes ttio uniquely Identified voic& dictation 
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received voice dictation file. The speech 

rtcognlilon progrom amDmadcaDy cjontes a 
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is trained with an acoustic model for the 
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SYSTEM AND METHOD FOR AUTOMATING TRANSCRIPTION SERVICES 

Backoimiinrf nfth.. Tf^y^pjf^n 

1 . Field of the Inventioa 

The present invention relates in general to computer speecli recognition systems 
5 and. in particular, to a system and method for automattog the tort transcription of voice 
dictation by various end users. 

2. Background Art 

Speech recognition programs are weU known in the art. While these programs 
are ultimately usefUl in automatically convertmg speech into text, many users are 
10 dissuaded from using these programs because they require each user to spend a 

significant amount of time trainmg the system. Usually this training begms by having 
each user read a series of pre-selected materials for approximately 20 minotes. Then, 
the user continues to use the program, as words ore hnpioperly transcribed the 
expected to stop and train the program as to the intended word thus advancing the 
ultimate accuracy of the acoustic model Unfortunately, most professionals (doctors, 
dentists, veterinarians, lawyers) and business executive are unwilling to spend the tii^ie 
developing the necessajy acoustic model to truly benefit from the automated 
transCTiption. 

Accordingly, it is an object of the present invention to provide a system that 
20 offers transparent trainhig of the speech recognition program to the end-users. 

There are systems fi»r using computers for routing transcription from a group of 

end users. Most often these systems are used in large multi-user settings such as 

hospitals. In those systems, a voice user dictates into a general-puipose computer or 

other recording device and the resulting file is transfeixed automatically to a human 

transcriptiodist. The human transcriptionist transcribes the file, which is then returned to 

the original "author* fbr review. These systems have the perpetual overhead of 

employing a sufficient number of human transcriptionist to transcribe all of the dictation 
files. 
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Accordingly it is another object of the present invention to jwovide an automated 
means of translating speech into text where ever suitable so as to minimize the number of 
human transcriptionist necessary to transcribe audio files coramg into the system. 

It is an associated object to provide a shnplified means for providing verbatim 
5 text files for training a user's acoustic model for the speech recognition portion of the 
system. 

It is another associated object of the present invention to automate a preexisting 
speech recognition program toward fiirther minimizing the number operators necessary 
to operate the system. 

1 0 These and other objects will be apparent to those of ordinaiy skill in the art 

having the present drawings, specification and claims before them. 
Summary of the Invention 

The present invention comprises, in part, a system for substantially automating 
transcription services for one or more voice users. The system inchides means for 
creatiug a uniquely identified voice dictaOon file from a current user and an audio player 
used to audibly reproduce said uniquely identified voice dictation file. Both of these 
system elements can be implemented on the same or different general-puipose 
conqjuters. AddiUonaUy, the voice dictation file creating means inchjdes a system for 
assigning unique file handles to audio files and an audio recorder, and further comprise 
20 means for cperably connectmg to a separate digital recordmg device and/or means for 
reading audio files from removable magnetic and other computer media. 

Each of the general purpose computers implementing the system may be 
remotely located &om the other computers but in operable connection to each other by 
way of a computer network, direct telqjhone connection, via email or other Internet 
25 based transfer. 

The system further includes means for manually inputting and creating a 
transcribed file based on humanly perceived contents of the uniquely identified voice 
dictation file. Thus, for certain voice dictation files, a human transcriptionist manually 
transcribes a textual version of the audio - using a text editor or word processor - based 
30 on the output of the output of the audio player. 
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The system also includes means for automatically converting the voice dictation 
file into written text. The automatic speech converting means may be a preedsting 
speech recognrtion program, such as Dragon Systems' Natui^y Speaking. IBM's Via 
Voice or Philips Corporation's Magic Speech. In such a case, the automatic speech 
converting means includes means for automating responses to a series of interactive 
inquiries from the preejsisting speech recognition program. In one embodiment, the 
system also includes means fbr manually selecting a specialized language model 

The system fiirther includes means for manually editing the resulting written text 
to create a verbatim text of the voice dictation file. At the outset of a users use of the 
system, this verbatim text wUl have to be created completely manually. However, after 
the automatic speech converting means has begun to sufiicienUy develop tiut user's 
acoustic model a more automated means can be used. 

In a prefeired embodiment, that manual editing means includes means for 
sequentially comparmg a copy of the written text with the transcribed file resulting in a 
sequential list of umnatched words culled from the copy of said written text. The manual 
editing means further includes means for incrementally searchbg for the cun«nt 
unmatched word contemporaneously within a first buffer associated with die speech 
recognition program containing the written text and a second buffer associated with the 
sequential list. Finally, tiie prefeired manual editifig means includes means for 
correcting tiie current unmatched word in the second buffer, which includes means for 
displaying the currem unmatched word in a manner substantially visually isolated from 
otiier text in tiie written text and means for playing a portion of the voice dictation 
recording from said first buffer associated with said current umnatched word. In one 
embodiment, tiie manual input means fiirther includes meatis for alternatively viewing 
the current unmatched word 5n conteJtt within tiie written text For instance, the operator 
may wish to view tiie unmatched within tiie sentence in which it appears or perhaps witii 
only is immediately adjacent words. Thus, tiie manner substantiolly visual isolation can 
be manually selected fi^m tiie group containing word-by-word display, sentence-by. 
sentence display, and said current unmatched word display. The manual editmg means 
portion of Uie complete system may also be utilized as a separate apparatus. 

The system may also include means for detennining tiie skill of a human 
tianscriptionist. In one approach, tiiis accuracy determination can be made by 

3 
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detennining the ratio of the number of words in the sequential list of unmatched words to 
the number of words in the written text. 

The system additionally inchides means for training the automatic speech 
converting means to achieve higher accuracy for the cuirent user. In particular, the 
5 training means uses the verbatim tent created by the manual editing means and 'the voice 
dictation file. The trainmg means may also comprise a preexisting training portion of the 
preejcisting speech recognition program. Thus, the training means would also inchide 
means for automating responses to a series of imeractive inquiries from the pieexisting 
training portion of the speech recognition ptogram. This fiinctionaUty can be used, for 
1 0 instance, to establish a new language model (i.e. foreign language). 

The system finally inchides means for controllmg the flow of the voice dictation 
file based upon the traintag status of the cuirent user using the unique identification. The 
control means reads and modifies a user's training status such that it is an appropriate 
selection from the group of pre^nroUment, enrollment, training, automation and stop 

15 automation. During a user's prfr-enrollmBut phase the control means fimher includes 
means for creating a user identification and acoustic model within the automatic speech 
converting means. During this phase, the control means routes the voice dictation file to 
the automatic speech convening means and the manual input means, routes the written 
text and the transcribed file to the mamlal editing means, routes the verbatim text to the 

20 training means and routes the transcribed file back to the curreat user as a finished text. 

During the training phase, the control means routes (1) the voice dictation file to 
the automatio speech converting means and the manual input means. (2) routes the 
written text and the transcribed file to the manual editing means. (3) routes the verbatim 
text to the trainmg means and (4) routes the transcribed file back to the current user as a 
25 finished text. 

During the automation stage, the control means routes ( 1) the voice dictation file 
only to the automatic speech converting means and (2) the written text back to the 
current user as a finished text. 



30 



The present appUcation also discloses a method for automating transcription 
semces for one or more voice users in a system including a manual transcription station 
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and a speech recognition progranL The method comprismg the steps oft (1) establishing 
a profile for each of the voice users, the profile containing a tratoing status; (2) cre&ting a 
uniquely identified voice dictation file from a current voice user, (3) choosing the 
training status of the current voice user from the group of enrollment, training, automated 
5 and stop automation; (4) routing the voice dictation file to at least one of the manual 
transcription station and the speech recognition program based on the training status, (5) 
receiving the voice dictation file m at least one of the manual transcription station and 
the speech recognition program; (6) creating a transcribed file at the manual transcription 
station for each received voice dictation file; (7) automatically creating a written text 
10 with the speech recognition program for each received voice dictation file if the training 
status of the cuirent user is training or automated; (8) manually establishing a verbatim 
file if the training status of the current user is enroUment or training; (9) training the 
speech recognition program with an acoustic model for the current user using the 
verbatim file and the voice dictation file if the training status of the current user is 
1 5 enrollment or training; (] 0) returning the transcribed file to the current user if the 

training status of the current user is enrollment or training; and (1 1) rotuming the written 
text to the current user if the training status of the current user is automated. 

Brief DRsijii^ptl^ff, f>f the nraiyinfy 

Fig. 1 of the drawings is a block diagnun of one potential embodiment of the 
20 present system for substantially automating transcription services for one or more voice 
users; 

Pig. lb of the drawings is a block diagram of a generaJ-puipose computer which 
may be used as a dictation station, a tmnscription station and the oontiol means within 
the present system; 

25 Fig. 2a of the drawings is a flow diagram of the main loop of the control means 

of the present system; 

Fig. 2b of the drawings is a flow diagram of the enrollment stage portion of the 
control means of the present system; 

Fig. 2c of the drawings is a flow diagrsro of the training stage portion of the 
30 control means of the present system; 
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Fig. 2d of the drawings is a flow diagram of the automatioti stage portion of the 
control means of the present system; 

Fig. 3 of the drawings is a directory structure used by the control means in the 
present system; 

5 Pig. 4 of the drawings is a block diagram of a portion of a preferred embodiment 

of the manual editing means; and 

Fig. 5 of the drawings is an elevation view of the remainder of a preferred 
embodiment of the manual editing means. 

Best Mode-s nf Practicing the Invention 

10 While the present invention may be embodied in many different forms, there is 

shown in the drawings and discussed herein a few specific embodiments with the 
understanding that the present disclosure is to be considered only as an exemplification 
of the principles of the invention and is not intended to limit the invention to the 
embodiments illustrated. 

Fig. 1 of the drawings generally shows one potential embodiment of the present 
system for substantiaUy automating transcription services for one or more voice users. 
The present system must include some means for receiving a voice dictation file from a 
current user. This voice dictation file receiving means can be a digital audio recorder, an 
analog audio recorder, or standard means for receiving computer files on magnetic media 
20 or via a data connection. 

As shown, in one embodimeni, the system 100 includes multiple digital recording 
stations 10, 1 1, 12 and 13. Each digital recording station has at least a digital audio 
recorder and means for identifying the current voice user. 



15 



25 



Preferably, each of these digital reconJing stations is implemented on a g^eraJ- 
purpose computer (such as computer 20^ although a specialized computer could be 
developed for this specific purpose. The general-purpose computer, though has the 
added advantage of being adaptable to varying uses in addition to operating within the 
present system ] 00. In general, the general-purpose computer should have, among other 
elements, a microprocessor (such as the Imd Corporation PENTIUM, Cyrix K6 < 
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Motorola 6B000 series); volatile ai«i non-volatile m^ory; one or more storage 
devices (i.e. HDD (not shown), floppy drive 21. and other removable media devices 22 
such as a CD-ROM drive, DITTO, ZIP or JA2 drive (from Iomega Con»ration) and the 
like); various user input devices, such as a mouse 23, a keyboard 24, or a microphone 25; 
and a video display system 26. In otie embodiment, the general-purpose computer is 
controlled by the WINDOWS 9x operating system. It is contemplated, however, that 
the present system would work equaUy weU using a MACINTOSH compotar or even 
another operating system such as a WINDOWS CE. UNDC or a JAVA based operating 
system, to name a fbw. 

R^less of the particular computer platfbim used, in an embodiment utilizing 
an analog audio input (via microphone 25) the general-purpose computer must include a 
sound-card (not shown). Of course, in an embodiment with a digital input no sound card 
would be necessary. 

In the embodiment shown m Fig. 1, digital audio recording stations 10, ] 1, 12 
and 13 are loaded and configured to nm digital audio recording software on a 
PENTIUM-based computer system operating under WINDOWS 9.x. Such digital 
recordhig software is available as a utility in the WINDOWS 9.x operating system or 
from various third party vendor such as The Programmers' Consorthim, Inc. of Oakton. 
Virginia (VOICEDOC), SyntrilUum Corporation of Phoenix, Arizona (COOL EDIT) or 
Dragon Systems Corporation (Dragon Natm-ally Speaking Professional Edition). These 
various software programs produce a voice dictation file in the form of a "WAV" file. 
However, as would be known to those skiUed in the art. other audio file formats, such as 
MP3 or DSS, could also be used to format the voice dictation file^ without departing 
from the spirit of the present invention. In one embodiment where VOICEDOC software 
is used that software also automatically assigns a file handle to the WAV file, however, it 
would be known to those of ordinary skill to the art to save an audio file on a computer' 
system using standard operating system file management methods. 

Another means for receiving a voice dictation file is dedicated digital recorder 14. 

such as the Olympus Digital Voice Recorder D-1000 manufectured by the Olympus 

30 Corporation. Thus, if the current voice user is more comfortable with a more 

conventional type of dictation device, they can continue to use a dedicated digital 

recorder 14. In order to harvest the digital audio text file, upon completion of a 

7 
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reconling, dedicated digital recorder 14 would be operably co^^cted to one of the 
digital aud.o recording stations, such as 13. toward downloading the digital audio file 
into that general-purpose computer. With this approach, for instance, no audio card 

would be required. 

Another alternative for receiving the voice dictation file may consist of using one 
form or another of removable magnetic media containing a pre-recorded audio fUe With 
this alternative an operator would input the removabJe magnetic media into one of the 
digital aadio recording stations toward uploadhig the aiidio file into the system. 

In some cases U may be necessary to p,-«^process the audio flies to make them 
acceptable for processing by the speech recognition software. For instance, a DSS file 
fonnat may have to be changed to a WAV file fonnat. or the sampling rate of a digital 
auAo file may have to be upsampled or downsampled. For instance, in use the Olympus 
Digital Voice Recorder with Dnigon NaturaUy Speaking, Olympus' 8MHz rate needs to 
be upsampled to 1 1 MHz. Software to accomplish such pre-processing is available fix>m 
a variety of sources including SyntriUhim Corporation and Olympus Corporation. 

The other aspect of the digital audio recording stations is some means for 
Identifying the current voice user. The identifying means may include keyboard 24 upon 
which the user (or a separate operator) can input the current user's unique identification 
code. Of course, the user Identification can be input using a myriad of computer input 
devices such as pointing devices (e.g mouse 23). a touch screen (not shown), a light pen 
(not shown), bar-code reader (not shown) or audio cues via microphone 25. to name a 
few. 

In the case of a first time user the identifying means may also assign that user an 
Identification number after receiving potentially identifying infbnnation fh)m that user 
inchiding: (l) name; (2) address; (3) occupation; (4) vocal dialect or accent; eta As 
discussed in association with the control means, based upon this input information, a 
voice user profile and a sub-directoiy within the control means are established. Thus 
regardless of the particular identification means used, a user idemification must be 
estabhshed for each voice user and subsequently provided with a conesponding digital 
audio file for each use such that the control means can appropriately route and the system 
ultimately transcribe the audio. 
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In one embodiment of the present invention, the identifying means may also seek 
the manual selection of a specialty vocabulaiy. It is contemplated that the specialty 
vocabulary sets may be general for various users such as medical (i.e. Radiology, 
Orthopedic Suigeor. Gynecology) and legal (i.e. corporate, patent, litigation) or highly 
5 specific such that within each specialty the vocabulary parameters could be ftrther 
limited based on the particular drcomstances of a particular dictation file. For instance, 
if the current voice user is a Radiologist dictatiqg the reading of a abdominal CAT scan' 
the nomenclature is highly specialized and different fiom the nomenclature for arenaJ 
ultrasound. By nairowly segmenting each selectable vocabulary set an increase in the 
10 accuracy of the automatic speech converter is likely. 

As sho^Wl in Fig. 1, the digital audio recording stations may be operably 
connected to system 100 as part of computer network 30 or, alternatively, they may be 
operably connected to the system via internet host 15. As shown in F.g. lb, the general- 
purpose computer can be connected to both network jack 27 and telephone jack With 
the use of an internet host, connection may be accomplished by e-mailing the audio ffle 
via the Internet. Another method for completing such connection is by way of direct 
modem connection via remote control software, such as PC ANYWHERE, which is 
avaUable from Symantec Corporation of Cupertino. California. It is also possible, if the 
IP ad dress of digital audio recording Station 10 or internet host 15 is known, to transfer 
the audio fUe using basic file transfer protocol. Thus, as can be seen from the foregoing, 
the present system allows great flexibHity for voice users to provide audio input into the 
system. 

Control means 200 controls the flow of voice dictation file based upon the 
training status of the current voice user. As shown m Figs. 2a, 2b, 2c, 2d, control means 
200 comprises a software program operating on general purpose computer 40. In 
particular, the program is initialized in step 201 where variable are set, buffera cleared 
and the particular configuration for this particular installation of tiie control means is 
loaded. Control means continually monitors ataiget directory (such as "cmxent" (shown 
in Pig. 3)) to detetmine whetiier a new file has been moved into the taiget, step 202. 
30 Once a new file is found (such as «6723.id" (shown in Fig. 3)), a determination is made 
as to whether or not the current user 5 (shown in Fig. 1) is a new user, step 203. 
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For each new user (as indicated by the eristenoe of a ".pro" file in the "cinrenf ' 
subdirectory), a new subdirectory is established, step 204 (such as the "usem" 
subdirectory (shown in Fig. 3)). TTiis subdirectory is used to sioi* all of the audio files 
rxxxx.wav-), >vritten text Cxjcxxwrt"). verbatim tett Cxxscvb"). transcripUon text 
5 C*xxxx.t)rt") and user profUe ("userapm") for that particular user. Each particular job is 
assigned a unique number "xxxx" such that all of the files associated with a job can be 
associated by that number. With this directoiy structure, the number of users is 
practically Umited only by storage space within general-purpose computer 40. 



10 



Now that the user subdirectoiy has been established, the user profile is moved to 
the subdirectoiy. step 205. The contents of this user profile may vary between systems. 
The contents of one potential user profile is shown in Fig. 3 as containing: the user name, 
address, occupation and training status. Aside from the training status variable, which is 
necessary, the other data is usefiil in routing and transcribing the audio files. " 

The control means, having selected one set of files by the handle, deteimines the 
15 identity of the current user by comparing the ".id" file with its "user.tbl," step 206. Now 
that the user is known the user profile may be parsed that user's subdirectory and 
the current training status determined, step 207. Steps 208-21 1 are the triage of the 
current training status Is one of: enroDment, training, automate, and stop automation. 

Enrollment is the first stage in automating transcription services. As showD in 

20 Fig. 2b, the audio file is sent to transcription, step 301. In particular, the "xxxx.waV' file 

is transfeiTcd totranscriptionist stations 50 and 51. In a preferred embodiment, both 

stations are general-purpose computers, which run both an audio player and manual input 

means. The audio player is likely to be a digital audio player, although it is possible that 

an analog audio file could be transfeired to the stations. Various audio players are 

25 commonly available including a utUity in the WINDOWS 9.x operating system and 

various other third parties such fttjm The Progtiimmers' Consortmm. Inc. of Oakton. 

Virginia (VOICESCRIBE). Regardless of the audio player used to play the audio file, 

manual input means is running on the computer at the same time. This manual input ' 

means may comprise any of text editor or word processor (such as MS WORD, 

30 WordPerfect, AmiPro or Word Pad) in combination with a keyboard, mouse, or other 

user-interfece device. In one embodiment of the present invention, this manual input 

means may, itself, also be speech recognition software, such as Naturally Speaking fi-om 

10 
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Dragon Systems of Neivton, Massachusetts, Via Voice torn IBM Coiporation of 
Arnjonk, New York, or Speech Magic from Philips Corporation of Atlanta, Georgia. 
Human transcriptionist 6 listens to the audio file created by current user 5 and as is 
known, manually mputs the perceived contents of that recorded text, thus establishing 
5 the transcribed file, step 302. Being human, human transcriptioijist 6 is likely to unpose 
KCperience, education and biases on the text and thus not input a verbatim transcript of 
the audio file. Upon coropletion of the human transcription, the human transcriptionist 6 
saves the file and indicates that it is ready for transfer to the current users subdirectory as 
"xxxx.txf', step 303. 

10 Inasmuch as this current user is only at the enrollment stage, a human operator 

will have to listen to the audio file and manually compare it to the transcribed file and 
create a verbatim file, step 304. That verbatim file ''xxxx.vb" is also transferred to the 
current user's subdirectory, step 305, Now that verbatim text is available, control means 
200 starts the automatic speech conversion means, step 306. This automatic speech 

1 5 conversion means may be a preexistiij^g program, such as Dragon System's Naturally 
Speaking, IBM's Via Voice or Philips' Speech Magic, to name a few. Altematrveiy, it 
could be a unique program that is designed to specifically perform automated speech 
recognition. 

In a preferred embodiment. Dragon Systems' Naturally Speaking has been used 
20 by ninuing an executable simultaneously with Naturally Speakmg that feeds phantom 
keystrokes and mousing operations through the WIN32API, such that Naturally 
Speaking believes that it is interacting with a human being, when in fiict it is bcmg 
controlled by control means 200. Such techniques are well known in the computer 
software testing art and, thus, will not be discussed in detail. It should sufiioe to say that 
25 by watching the application flow of any sjieech recognition program, an executable to 
mimic the interactive manual steps can be created. 

If the current user is a new usef, the speech recognition program will need to 
establish the new user, step 307. Control means provides the necessary information from 
the user profile found in the current user's subdirectory. All speech recognition require 
30 significant training to establish an acoustic model of a particular user. In the case of 
Dragon, initially the program seeks approximately 20 minutes of audio usually obtahed 
by the user reading a canned text provided by Dragon Systems. There is also 
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functionality built into Dragon that aUows "mobUe training." Using this feature, the 
verbatiin me and audio file are fed into the speech recognition program to beginning 
trmnmg the acoustic model for that user, step 30B. Regardless of the length of that audio 
file, control means 200 closes the speech recognition program at the completion of fte 
file, step 309. 

As the eniDllment step is too soon to use the automatically created text, a copy of 
the transcribed file is sent to the currentuser using the address information contained in 
the user profile, step 3 1 0. This address can be a street addi^ss or an ^mail address. 
Following that transmission, the program returns to the main loop on Fig. 2a. 

After a certain number of minutes of training have been conducted for a 
particular user, that user's training status may be changed fi-om enrollment to training. 
The border for this change is subjective, hut perhaps a good rule of thumb is once 
Dragon appears to be creating written text w,th 80% accuracy or more, the switch 
between states can -be made. Thus, for such a user the next transcription event will 
prompt control means 200 into the training state As shown in Fig. 2c, steps 401-403 are 
the same human transcription steps as steps 301-303 in the enrollment phase. Once the 
transcribed file is established, control means 200 starts the automatic speech conversion 
means (or speech recognition program) and selects the current user, step 404. The audio 
me is fed into the speech recognition program and a written text is established within the 
program buffer, step 405. In the case of Drt«on. this buffer is given the same file handle 
on very instance of the program. Thus, that buffer can be easily copied using standard 
operating system commands and manual editing can begin, step 406, 

In one particular embodiment utilizing the VOICEWARE system from The 
Programmers- Consortium. Inc. of Oakton, Virginia, the user inputs audio into the 
VOICEWARE system's VOICEDOC program, thus, creating a " wav" file. In addition, 
before releasing this «.wav" file to the VOICEWAJRE server, the user selects a 
•transcriptionist." This "transcriptiomst" may be a particular human tianscriptionist or 
may be the "computerized transcriptionist" If the user selects a "computerized 
transcriptionirf • they may also select whether that transcription is handled locally or 
remotely. This file is assigned a job number by the VOICEWARE server, which routes 
the job to the VOICESCRIBE portion of the system. Nonnally, VOICESCRBE is used 
by the human transcriptionist to receive and playback the job's audio C wav") file In 
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addition, the audio file is grabbed by the automatic speech converaon means. In this 
VOICEWARE system embodiment, by placing VOICESCREBE in "auto mode" new 
jobs O-e. an audio file newly created by VOICEDOC) are automaticaUy downloaded 
from the VOICEWARE server and a VOICESCRIBE window having a window title 
formed by the job number of the cuirent ".wav" file. An executable file, running in the 
background "sees'* the VOICESCRIBE window open and using the WIN32API 
detennines the job number from the VOICESCWBE window title. The executable file 
then launches the automatic speech conversion means. In Dragon System's NataraUy 
Speaking, for instance, there is a buiU in fiinction for peifoiming speech recogmtion on a 
preexisting ".wav» file. The executable program feeds phantom keystrokes to Naturally 
Speaking to open the ".wav" file from the "cuirent" directory (see Fig. 3) having the job 
number of the current job. 

In this embodiment, after Naturally Speaking has completed automatically 
transcribing the contexts of the "wav" file, the executable file resumes operation by 
15 selecting all of the text in the open Naturally Speaking window and copying it to the 
WINDOWS 9.x operating system clipboard. Then, using the clipboard utiUty. save the 
clipboard as a text file using the current job number with a "dmt" suffix. The executable 
file then "clicks" the "complete" button in VOICESCRIBE to return the "dmt" file to the 
VOICEWARE server. As would be understood by those of ordinary skill in the art, the 
foregoing procedure can be done utDizing other digital recording software and other 
automatic speech conversion means. Additionally, functionality analogous to the 
WINDOWS cupboard exists in other operating systems. It is also possible to require 
human intervention to activate or prompt one or more of the foregoing steps. Further, 
although, the various programs executing various steps of this could be running on a 
number of interconnected computers (via a LAN, WAN, internet connectivity, email and 
the like), it is also contemplated that all of the necessary software can be running on a 
smgle computer. 

Another alternative approach is also contemplated wherein the user dictates 
directly into the automatic speech conversion means and the VOICWARE server picks 
up a copy in the reverse direction. This approach works as follows; without actually 
recording any voic^ the user clicks on the "complete" button hi VOICEDOC. thus, 
creating an empty ".wav" file. This empty file is nevertheless assigned a unique job 
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number by the VOICEWAREsesrv«sr. The user (or an executable file nmning in the 
background) then launches the automatic speech conversion means and the user dictates 
directly into that program, in the same manner previously used in association wth such 
automatic speech conversion means. Upon completion of the dictation, the user presses 
5 a button labeled 'Tetum" (generated by a backgromid executable file), which executable 
then commences a macro that gets the current job number ftom VOICEWARE (in the 
manner describe above), selects all of the text in the docament and copies it to the 
clipboard. The clipboard is then saved to the file "<jobnumber>.dmt." as discussed 
above. The executable then "clicks** the "complete" button (via the WIN32API) in 
1 0 VOICESCRBE, which effectively returns the automatically transcribed text file back to 
the VOICEWARE server, which, in turn, retums the completed transcription to the 
VOICESCRIBB user, Notably, although, the various programs executing various steps 
of this could be running on a number of interconnected computers (via a LAN, WAN, 
mtemet connectivity, email and the like), it is also contemplated that all ofthe'necesslry 
software can be running on a single computer. . As would be underwood by those of 
ordinary skill in the ait. the forgoing procedm-e can be done utiliang other digital 
recording software and other automatic speech conversion means. Additionally, 
fimctionality analogous to the WINDOWS clipboard exists in other operating 3y'stems..It 
is also possible to require human intervention to activate or prompt one or more of the 
20 foregoing steps. 

Manual editing is not an easy task. Human beings are prone to errors. Thus, the 
Fesent invention also includes means for improving on that task. As shown in Fig. 4, 
the transcribed file ("3333.txt") and the copy of the written text nsSS.wit") are 
sequentially compared word by word 406a toward establishing sequential lUt of 
unmatched words 406b that are culled from the copy of the written text. This Ust has a 
beginning and an end and pointer 406c to the current unmatched word. Underlying the 
sequential list is another list of objects which contains the original unmatched words, as 
well as the words immediately before and after that unmatched word, the starting 
location in memory of each unmatched word in the sequential list of unmatched words 
30 406b and the length ofthe unmatched word. 

As shown in Fig. 5, the unmatched word pointed at by pointer 406c from list 
406b is displayed in substantial visual isolation ftom the other text in the copy of the 
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written text on a standard computer monhor 500 in an active window 501 Aa shown in 
Fig. 5, the context of the unmatched word can be selected by the operator to be shown 
within the sentence it resides, word byword or in phrase context, by cUckmg on buttons 
514, 515, and 516, respectively. 

Associated with active window 501 is background window 502, which contams 
the copy of the written tort file. As shown in background window 502, a incremental 
search has located (see pointer 503) the next occurrence of the gunent unmatched word 
"cash " Contemporaneously therewith, within window 505 containhig the buffer from 
the speech recognition program, the same incremental search has located (see pointer 
506) the next occurrence of the current unmatched word. A human user will likely only 
being viewing active window 501 activate the audio replay from the speech recognition 
program by clicking on "play" button 510, which plays the audio synchronized to the 
tract at pointer 506. Based on that snippet of speech, which can be played over and over 
by clicking on the play button, the human user can manually input the correction to the 
current unmatched word via keyboard, mousing actions, or possibly even audible cues to 
anothw speech recognition program running within this window. 

In the present example, even if the choice of isolated contejct offered by buttons 
5 1 4, 5 1 5 and 5 1 6, it may stai be difficult to detennine the coirect veibatim word out-of - 
context, accordingly there is a switch window button 513 that will move background 
20 window 502 to the foreground with visible pointer 503 mdioating the current location 
within the copy of the written text. The user can then return to the active window and 
input the coirect word, "trash." This change will only effect the copy of the written text 
displayed in background window 502. 
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When the operator is ready for the next unmatched word, the operator dicka on 

the advance button 511, which advances pointer 406c down the list of unmatched words 

and activates the incremental search in both window 502 and 505. This unmatched word 

is now displayed in isolation and the operator can play the synchronized speech from the 

speech recognition program and correct this word as weU. If at any point in the 

operation, the operator would like to retain to a previous unmatched word, the operator 

30 clicks on the reverse button 512, which moves pomter 406c back a word in the list and 

causes a backward incremental search to occur. This is accomplished by using the 

underlying list of objects which contains the original unmatched words. This list is 
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traversed in object by object fashion, but alternatively each of the records could be 
padded such that each item has the same word size to assist in bi-directional traversing of 
the list. As the unmatched words in this underlying list are only it is possible to 
return to the original unmatched word such that the operator can determine if a different 
5 correction should have been made. 

Ultimately, the copy of the written text is finally corrected resulting in a verbatim 
copy, which is saved to the user's subdirectory. The verbatim file is also passed to the 
speech recognition program for training, step 407. The new (and improved) acoustic 
model is saved, step 408, and the speech recognition program is closed, step 409. As the 
1 0 system is still in training, the transcribed file is returned to the user, as in step 3 1 0 from 
the enrollment phase. 

As shown in Fig. 4, the system may also include means for determining the 
accuracy rate from the output of the sequential comparing means. Specifically, by 
counting the number of words in the written text and the number of words in list 406b 
15 the ratio of words in said sequential list to words in said written text can be determined^ 
thus providing an accuracy percentage. As before, it is a matter of choice when to 
advance users from one stage to another. Once that goal is reached, the user's profile is 
changed to the ne)ct stage, step 21 L 

One potential enhancement or derivative functionality is provided by the 
20 determination, of the accuracy percentage. In one embodiment, this percentage could be 
used to evakiate a human transcriptionist's skills. In particular, by using either a known 
verbatim file or a well-established user, the associated ".waV* file would be played for 
the human transcriptionist and the foregoing comparison would be performed on the 
transi^bed text versus the verbatim file created by the foregomg process. In this 
25 manner, additional functionality can be provided by the present system. 

As understood, currently, manufacturers of speech recognition programs use 
recording of foreign languages, dictions, etc. with manually established verbatim files to 
program speech models. It should be readily apparent that the foregoing manner of 
establishing verbatim text could be used in the initial development of these speech files 
30 simplifying tiiis process greatly. 
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Once the user has reached the automation stage, the greatest benefits of the 
present system can be achieved. The speech recognition software is started, step 600. 
and the current user selected, step 601 . If desired, a particularized vocabulary may be 
selected, step 602. Then automatic conversion of the digital audio file recorded by the 
current user may commence, step 603. When completed, the v«itten file is transmitted to 
the user based on the information contained in the user profile, step 604 and the program 
is returned to the main loop. 

Unfortunately, there may be instances where the voice users cannot use 
automated transcription for a period of time (during an iltaess. after dental work, etc.) 
because their acoustic model has been temporarily (or even pennaaently) altered. In that 
case, the system administrator may set the training status variable to a stop automation 
state in which steps 301. 302. 303, 305 and 310 (see Fig. 2b) aretiie only steps 
perfbnned. 

The foregoing description and drawings merely explain and illustrate tiie 
invention and tiie invention is not limited tiieieto. Those of tije skill in tiie art who have 
tiie disclosure before them will be able to make modifications and variations tiierein 
without departing fi-om tiie scope of the present invention. For instance, it is possible to 
implement all of tiie elements of tiie present system on a single general-purpose 
computer by essentially time sharing tiie machine between tiie voice user, transcriptionist 
and tiie speech recognition program. The resulting cost saving makes titis system 
accessible to more types of office situations not simply large medical clinics, hospital, 
law firms or other laige entities. 
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WHAT IS CLAIMED IS: 

1 . A system for substantially automating tratxscription services for one or more 
voice usere, said system comprising: 

5 " means for creating aimiquely identified voice dictation file from a current 

user, said current user being one of said one or more voice users; 

an audio player used to audibJy reproduce said uniquely identified voice 
dictation file; 

means for manually inputting and creating a transcribed file based on 
* ^ humanly perceived contents of said uniquely identified voice dictation 

file; 

means for automatically converting said uniquely identified voice 
dictation file Into written text, 

means for manuaUy editing a copy of said written text to create a verbatim 
' ^ text of said uniquely identified voice dictation file; 

means for training said automatic speech converting means to achieve 
higher accuracy with said uniquely identified voice dictation file of 
current user; and 



20 



means for controlling the flow of said uniquely identified voice dictation 
file based upon a training status of said current user, whereby said 
controlling means sends said uniquely identified voice dictation file to at 
least one of said manual input means and said automatic speech 
converting means. 
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2. The invention according to Caaim 1 forther comprising means for transferring 
said written text into a written text ffle, 

3. The invention according to Qaim I wherein said written tea is at least 
temporarily Bynchronized to said uniquely identified voice dictation file, said 
manual editing means comprises: 

means for sequentially comparing a copy of said written text with said 
transcribed file resulting in a sequential list of unmatched words culled 
from said copy of said written text, said sequential list having a 
beginning an end and a current unmatched word, said current unmatched 
word being successively advanced from said beginning to said end; 

means for incrementally searching for said current unmatched word 
contemporaneously within a first buffer associated with the speech 
recognition program containing said written te« and a second buffer 
associated with said sequential list; and 

15 - "leans for con^ctit^g said current unmatched word in said second buffer, 

said correcting means including means for displaying said current 
unmatched word in a mamier substantially visually isolated from other 
testt in said copy of said written text and means for playmg a portion of 
said synchronized voice dictation recording from said first buflfer 
associated with said current unmatched word. 

4. The invention according to Claim 3 wherein said coirecdng means fimher 
Includes means for altenjatrvely viewing said current unmatched word in context 
within said copy of said written text. 

5. The invention according to Claim 3 fiirther inchiding means for detennining an 
25 accurate rate for said current user. 

6. The bvention according to Claim 5 wherein said verbatim fde is a known 
accurate file, invention flirther includes ttieans for determining sWll of a human 
operator based on said accuracy rate. 
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7. The mvefltton according to Qaim 4 wherein said sequential list and said mitteii 
text each have a respective number of words, said accuracy rate determining 
means determines the ratio of said number of words in said sequential list to said 
nuniber of words in said written tesrt. 

8. The invention according to Claim 1 wherein said automatic speech converting 
means comprises a preexisting speech recognition program intended for human 
interactive use, said automatic speech converting means includes means for 
automating responses to a series of interactive inquiries from said preexisting 
speech recognition program. 

9. The invention according to Claim 8 wherein said training means comprises a 
preexisting training portion of said preexisting speech recognition program 
intefided for human interactive use, said training means includes means for 
automatuig responses to a series of interactive inquiries from said prcBdating 
ti-aining portion of said preexisting speech recognition program. 

The mvention according to Claim 1 wherein said ti^aining means comprises a 
preexisting training program intended for human interactive use, said tiBining 
means includes means for automating responses to a series of interactive inquiries 
from said preexisting training program. 

The invention according to Claim I wherem said control means reads and 
modifies a user profile associated with said current user, each of said user profiles 
including said training status of said current user. 



12. The invention according to Claim 1 1 wherein said training status is selected from 
the group of pre-enroUment, enrollment, training, automation and stop 
autoina.tioti. 



25 13. 



The invention according to Claim 12 when said training status is pre-enroUmcni 
said control means fljither mdudes means for creating a user identification and 
acoustic model within said automatic speech converting means. 

14. The invention according to Claim 12 when said training status is enrollment said 
control means routes said voice dictation file to said automatic speech comrerting 
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means and said manual input means, routes said written text and said transcribed 
file to said manual editing means, routes said verbatim text to said training means 
and routes said transcribed file back to said current user a^ a finished text. 
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The invention according to Qaim 12 wlien said training status is training said 
control means routes said voice dictation file to said automatic speech converting 
means and said manual input means, routes said written text and said transcribed 
file to said manual editing means, routes said verbatim text to said training means 
and routes said transcribed file back to said current user as a finished text. 

The invention according to Qaim 12 when said training status is automation said 
control means routes said voice dictation file only to said automaUc speech 
converting means and routes said written text back to said current user as a 
finished text. 

An apparatus for substantially simplifying the production of a foreign language 
speech model for said speech recognition program wherein said foreign language 
provides a sufficient set of words to teach the voice dictation recording baaed 
upon a transcribed file produced by a human tfianscriptionist and a written text 
produced by a speech recognition program, wherein said written text is at least 
temporarily synchronized to said voice dictation recording, said apparatus 
comprising; 

' sequentially comparing a copy of said written text with said 

transcribed file resulting in a sequential list of unmatched words cuUed 
from said copy of said written text, said sequential list having a 
beginning, an end and a current unmatched word, said cuireat unmatched 
word bemg successively advanced from said beginning to said end; 
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means for incrementaUy searching fbr said current unmatched word 
contemporaneously within a first bufier associated with the speech 
recognition program contaming said written text and a second buffer 
associated with said sequential list; and 

means for correcting said current unmatched word in said second buffer, 
said correcting means including means for displaying said current 
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unmatched tvord in a manner substantially visually isolated from other 
t&a in said copy of said wittea text and means for playing a portion of 
said synchTDnized voice dictation recording from said first bufSEfer 
associated with said current unmatched word, 

5 18. The invention according to Claim 1 7 wherein said correcting means furtb^ 

includes means for alternatively viewing said cuirent unmatched word in context 
within said copy of said written text. 

19. The bvention according to Claim 1 8 wherein said manner substantially visually 
isolated from other text can be manually seJected from the group containing 

1 0 word-by-word display, sentence-by-sentence display, and said current unmatched 

word display. 

20. A method for automating transcription services for one or more voice useis in a 
system including a manual transcription station and a speech recognition 
program, said method comprising the steps of: 

^ ^ - establishing a profile for each of the voice users, the profile containing a 

training status; 

creating a uniquely identified voice dictation file for a current voice user; 

choosing the trafaaing status of the current voice user from the group of 
enrolhnent, training, automated and stop automatiDn; 

" routing the uniquely identified voice dictation file to at least one of the 

manual transcription station and the speech recognition program based on 
the training status; 

receiving the uniquely identified voice dictation file in at least one of the 
manual transcription station and the speech recognition program; 

2^ " creatiDg a transcribed file at the manual transCTiption station for each 

received uniquely identified voice dictation file; 
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automatically creating a written text with the speech recognition program 
for each received uniquely identified voice dictation file if the traimng 
status of the current user is training or automated^ 

raamially establishing a verbatim file if the training status of the current 
user is enrollment or training; 

training the speech recognition program with an acoustic model for the 
current user using the verbatim file and the uniquely identified voice 
dictation file if the training status of the currwit user is enrolhnent or 
training; 

returning the transcribed file to the current user if the training status of the 
current user is enrollment or training; and 

returning the written text to the current user if the training status of the 
current user is automated. 

21 . The invention according to Claim 30 wherein said step of manually establishing a 
1 5 verbatim file includes the sub-steps of: 

assisting an operator to establish the verbatim file if the training status of 
the current user is training by: 

sequentially comparmg a copy of the written text with the 
transcribed file resulting in a sequential list of unmatched words 
culled from the copy of the written text, the sequential list having 
a beginning, an end and a current unmatched word, the currmt 
unmatched word being successively advanced from tiie beginning 
to the end; 



20 



25 



incrementally searching for tiie current unmatched word 
contemporaneously within a first buffer associated whh the speech 
recognition program containing the written tejct and a second 
buffer associated with the sequential list; and 
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displaymg the current uimiatched word ill a manner subsmtioU^ 
visually isolated from other text in the copy of the written tew and 
playing a portion of the synchronized voice dictation recording 
from the first buffer associated with the current unmatched ward; 
5 and 

correcting the currant unmatched word to be a verbatim 
representation of the portion of the synchronized voice dictation 
recordiag. 

22. A method for testmg the sMJls of a human transcriptionist using a known accurate 
10 written text created by a speech recognition program and a transcribed file 

created by the human transcriptionist, the method comprising: 

sequenUally comparing a copy of the written text with the 
transcribed file resulting in a sequential list of unmatched words 
cuUcd from the copy of the written text, the sequential list having 
a beginning, an end and a current unmatched word, the current 
unmatched word being successively advanced from the beginning 
to the end; 

incrementally searching for the current unmatched word 
contemporaneously within a first bufier associated with the speech 
recognition program containing the written tejtt and a second 
buffer associated with the sequential lisi; and 

displaying the current unmatched word in a manner substantially 
visually isolated fi-om other text in the copy of the written text and 
playing a portion of the synchronized voice dictation recording 
fi-om the first buffer associated with the current unmatched word; 
and 

calculating the accuracy rate of the human transcriptionist. 
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